{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "afd55886-5f5b-4794-838e-ef8179fb0394",
   "metadata": {},
   "source": [
    "##### **** These pip installs need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release. Example for transform developers working from git clone:\n",
    "```\n",
    "make venv\n",
    "source venv/bin/activate && pip install jupyterlab\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "## This is here as a reference only\n",
    "# Users and application developers must use the right tag for the latest from pypi\n",
    "!pip install data-prep-toolkit\n",
    "!pip install data-prep-toolkit-transforms[doc_chunk]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "##### **** Configure the transform parameters. We will only show the use of data_files_to_use and doc_chunk_chunking_type. For a complete list of parameters, please refer to the README.md for this transform\n",
    "##### \n",
    "| parameter:type | value | Description |\n",
    "| --- | --- | --- |\n",
    "|data_files_to_use: list | .parquet | Process all parquet files in the input folder |\n",
    "| doc_chunk_chunking_type: str | dl_json | |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf1f782-0e61-485c-8670-81066beb734c",
   "metadata": {},
   "source": [
    "##### ***** Import required Classes and modules"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dpk_doc_chunk.transform_python import DocChunk"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7234563c-2924-4150-8a31-4aec98c1bf33",
   "metadata": {},
   "source": [
    "##### ***** Setup runtime parameters for this transform and invoke transform"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e90a853e-412f-45d7-af3d-959e755aeebb",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "07:53:11 INFO - doc_chunk parameters are : {'chunking_type': 'dl_json', 'content_column_name': 'contents', 'doc_id_column_name': 'document_id', 'output_chunk_column_name': 'contents', 'output_source_doc_id_column_name': 'source_document_id', 'output_jsonpath_column_name': 'doc_jsonpath', 'output_pageno_column_name': 'page_number', 'output_bbox_column_name': 'bbox', 'chunk_size_tokens': 128, 'chunk_overlap_tokens': 30, 'dl_min_chunk_len': None}\n",
      "07:53:11 INFO - pipeline id pipeline_id\n",
      "07:53:11 INFO - code location None\n",
      "07:53:11 INFO - data factory data_ is using local data access: input_folder - test-data/input output_folder - output\n",
      "07:53:11 INFO - data factory data_ max_files -1, n_sample -1\n",
      "07:53:11 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet'], files to checkpoint ['.parquet']\n",
      "07:53:11 INFO - orchestrator doc_chunk started at 2024-12-11 07:53:11\n",
      "07:53:11 INFO - Number of files is 1, source profile {'max_file_size': 0.011513710021972656, 'min_file_size': 0.011513710021972656, 'total_file_size': 0.011513710021972656}\n",
      "07:53:15 INFO - Completed 1 files (100.0%) in 0.062 min\n",
      "07:53:15 INFO - Done processing 1 files, waiting for flush() completion.\n",
      "07:53:15 INFO - done flushing in 0.0 sec\n",
      "07:53:15 INFO - Completed execution in 0.062 min, execution result 0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "DocChunk(input_folder='test-data/input',\n",
    "        output_folder='output',\n",
    "        doc_chunk_chunking_type= \"dl_json\").transform()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
   "metadata": {},
   "source": [
    "##### **** The specified folder will include the transformed parquet files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "7276fe84-6512-4605-ab65-747351e13a7c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['output/metadata.json', 'output/test1.parquet']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import glob\n",
    "glob.glob(\"output/*\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "50fa6c21-ff73-4e9e-88a0-6fc85a5ef8e0",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
